Musicbrainz reduce memory used by processing by chunks #165

Yueqiao12Zhang · 2024-08-22T15:48:02Z

The bug encountered: I moved the code for reading json to other places, but forgot to delete the original part. It causes python to read the JSON twice.
After this is fixed, I read the file metadata and parse it into chunks of 4096 records. This solves the problem.

This reverts commit e79de48.

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

dchiller · 2024-08-22T18:01:54Z

Can you explain in the PR description how this solves the issue?

Or this commit (fix: delete buggy code) would be a great place to have a message about what the buggy code was!

ahankinson · 2024-08-22T21:29:44Z

You could also try using a more efficient JSON library like ujson or orjson. The built-in json library is well known to be fine for quick tasks but mot great for speed or memory use.

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

musicbrainz/csv/convert_to_csv.py

dchiller · 2024-08-26T14:24:47Z

musicbrainz/csv/convert_to_csv.py


+CHUNK_SIZE = 4096


Could this go at the top with the other configuration constant?

Can you also add a comment on why you chose 4096?

Can you also add a comment on why you chose 4096?

I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.

Can you also add a comment on why you chose 4096?

I don't know exactly, I think this could be any number not too large, but GPT and stack overflow all use something like 4096 and 8192, so I decided to use 4096.

So, comment that in the code: "4096 was chosen because ChatGPT and StackOverflow examples typically use 4096 or 8192."
In general, always comment (justify) why you chose a value or a method as opposed some other option.

musicbrainz/csv/convert_to_csv.py

change the type of recording from Q49017950 to Q482994

dchiller · 2024-08-30T11:21:01Z

@candlecao You should commit 805c464 to a different branch...

ahankinson · 2024-09-09T15:49:26Z

musicbrainz/csv/convert_to_csv.py

@@ -195,7 +195,7 @@ def convert_dict_to_csv(dictionary_list: list) -> None:
            with open(
                "temp.csv", mode="a", newline="", encoding="utf-8"
            ) as csv_records:
-                writer_records = csv.writer(csv_records)
+                writer_records = csv.writer(csv_records, delimiter="\t")


This is no longer comma-separated now... it's a different format.

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

… to update

dchiller · 2024-09-27T17:08:11Z

@Yueqiao12Zhang @candlecao

I'm not sure what the status of this is. Is it ready to review again? I'm confused by the number of unrelated commits!

Yueqiao12Zhang and others added 15 commits August 19, 2024 10:41

refactor: more ignored keys

aca1887

refactor: avoid new allocation

110263b

doc: explain if statement

f9d2091

refactor: extract and convert in chunks

f72eb1e

fix: write header

48adb26

refactor: if not first level, then don't extract name

87d9fe5

refactor: refresh values list

6659d28

Update convert_to_csv.py

e79de48

Revert "Update convert_to_csv.py"

6c565be

This reverts commit e79de48.

refactor: read jsonl in chunks

088c40d

Merge branch 'main' into musicbrainz-reduce-memory-used

9ff292f

test: add print tests

2cbea9b

Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…

7ae3914

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

fix: delete buggy code

04b3c41

style: delete print test statements

8203c71

Yueqiao12Zhang self-assigned this Aug 22, 2024

Yueqiao12Zhang linked an issue Aug 22, 2024 that may be closed by this pull request

Cannot open the file: MusicBrainz releases #155

Closed

Yueqiao12Zhang requested a review from dchiller August 22, 2024 16:39

Yueqiao12Zhang and others added 7 commits August 23, 2024 09:39

fix: memory bug and header writting

6e51f96

Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…

e23ed58

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

Merge branch 'main' into musicbrainz-reduce-memory-used

740d365

Merge branch 'main' into musicbrainz-reduce-memory-used

c0d1c78

test: delete unused test files

6c6f920

Update genre.csv

99ff276

fix: header problem by making a temp csv

e757df8

dchiller reviewed Aug 26, 2024

View reviewed changes

Update mapping.json

805c464

change the type of recording from Q49017950 to Q482994

fix: filename bug, readlines bug, opening wrong file bug

29dc943

Yueqiao12Zhang linked an issue Sep 6, 2024 that may be closed by this pull request

Musicbrainz conversion: cannot append header after chunk reading #167

Closed

fix: auto escape by using \t not comma

28243a5

ahankinson reviewed Sep 9, 2024

View reviewed changes

Yueqiao12Zhang and others added 18 commits September 13, 2024 09:55

feat: add quotechar to distinguish ", and , separator

37201ab

refactor: temp should be a tsv

72b92e6

fix: correct a few temp.csv error

b64da8a

fix: file suffix bug

241dea3

Merge branch 'musicbrainz-reduce-memory-used' of https://github.com/D…

956f924

…DMAL/linkedmusic-datalake into musicbrainz-reduce-memory-used

mappings: update by Junjun

3146891

Update .gitignore

b58e618

test: only include genre.csv since it's not updated often and is slow…

d6e48bf

… to update

test: remove old test files

23a6861

Update .gitignore

222e715

Update .gitignore

dcaf73e

Merge branch 'main' into musicbrainz-reduce-memory-used

63c8914

doc: update lock file based on pyproject.toml

8d4ef18

refactor: visualize file processing

ed26803

feat: add multi-file export option

622cefe

mapping: add reconciled columns

8e99189

doc: add other reconciled columns

84e4b1b

doc: add specification about multi-file output

798d448

candlecao and others added 5 commits October 21, 2024 11:19

update mapping.json

13e8601

refactor: add type recognition to musicbrainz

6c6401e

ignore: add ignore separate ttls

d380d46

Merge branch 'main' into musicbrainz-reduce-memory-used

c4be2a5

test: remove test print

9fe4df5

candlecao approved these changes Oct 25, 2024

View reviewed changes

candlecao merged commit dbcee82 into main Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Musicbrainz reduce memory used by processing by chunks #165

Musicbrainz reduce memory used by processing by chunks #165

Yueqiao12Zhang commented Aug 22, 2024 •

edited

Loading

dchiller commented Aug 22, 2024 •

edited

Loading

ahankinson commented Aug 22, 2024

dchiller Aug 26, 2024

fujinaga Aug 26, 2024

Yueqiao12Zhang Aug 30, 2024

fujinaga Aug 30, 2024

dchiller commented Aug 30, 2024

ahankinson Sep 9, 2024

dchiller commented Sep 27, 2024


		CHUNK_SIZE = 4096

Musicbrainz reduce memory used by processing by chunks #165

Musicbrainz reduce memory used by processing by chunks #165

Conversation

Yueqiao12Zhang commented Aug 22, 2024 • edited Loading

dchiller commented Aug 22, 2024 • edited Loading

ahankinson commented Aug 22, 2024

dchiller Aug 26, 2024

Choose a reason for hiding this comment

fujinaga Aug 26, 2024

Choose a reason for hiding this comment

Yueqiao12Zhang Aug 30, 2024

Choose a reason for hiding this comment

fujinaga Aug 30, 2024

Choose a reason for hiding this comment

dchiller commented Aug 30, 2024

ahankinson Sep 9, 2024

Choose a reason for hiding this comment

dchiller commented Sep 27, 2024

Yueqiao12Zhang commented Aug 22, 2024 •

edited

Loading

dchiller commented Aug 22, 2024 •

edited

Loading